Echo control device with quick response to sudden echo-path change

ABSTRACT

An Acoustic Echo Control (AEC) device operates to reduce acoustic echo feedback in a speakerphone. An adaptive echo canceller responds to a far-end speech signal to generate an estimated echo which is subtracted from a near-end speech signal, to generate a compensated near-end speech signal. An Echo Return Loss (ERL) estimator provides an accurate but gradually adjusted estimate of the ERL of the echo canceller. A non-linear processor responds to the ERL estimation and provides additional attenuation to maintain the overall system ERL. The AEC device also incorporates dual talking mode detectors, one for the adaptive filter and one for the non-linear processor, to detect a plurality of talking modes, which are used to control the adaptation of the adaptive filter and the attenuation provided by the non-linear processor. A convergence indicator responds to the sudden echo path changes, and quickly corrects the ERL estimation; thereby adjusting the talking mode detector and the non-linear processor accordingly to maintain the overall system ERL and stability.

BACKGROUND OF THE INVENTION

The present invention relates generally to voice communication systems and, more particularly, to an electronic system for reducing acoustic echo. Speakerphones, which employ one or more microphones together with one or more speakers to enable "hands-free" telephone communication, can allow "hands-free" communications as well as participation in a conversation by a number of persons. Modern speakerphones are capable of operating in a variety of modes, which include, single-talking mode in which the transmission of voice information is in a single direction, and double-talking mode, in which the voice information is transmitted by both sides, which increases the interactivity of the conversation, but also causes a phenomena known as "acoustic feedback echo", in which acoustic energy transmitted by the speaker of the speakerphone is picked up by the microphone of the same speakerphone. Typically, speakerphones utilize an Acoustic Echo Control (AEC) device to reduce this echo by generating an estimate of the expected feedback ("acoustic echo") between the speaker and the microphone, and subtracting the expected echo from the signal produced by the microphone (the "near-end signal") before transmission of the signal to a remote communications station. Generally, the AEC device is adaptive in the sense that changes in the acoustic echo path are accounted for in generating the estimated echo.

Typical AEC devices consist of an echo canceller filter cascaded with a non-linear processor. The echo canceller filter generates a linearly corrected near-end signal and the non-linear processor, in conjunction with a talking mode detector, which detects various talking modes (single-talking, double-talking, etc.), provides additional echo attenuation for certain talking modes. The additional attenuation provided by the nonlinear processor increases the echo cancellation performance, also known as the echo return loss enhancement, but reduces the degree of the double-talking operation, therefore, reducing the interactivity of the conversation. Thus a typical echo control device strikes a compromise between the interactivity and the echo return loss performance.

Under steady state conditions, the echo canceller converges to very nearly cancel the echo, tracking only gradual changes to avoid instability. Sudden changes occurring in the echo path, such as a relative repositioning of the speaker and microphone disturb the system. Typical echo-cancellers and nonlinear processors are slow to respond to a sudden change in the echo path. Thus, when a change happens, the system performance, such as echo return loss enhancement and stability, is significantly degraded until the system eventually re-converges.

To respond to sudden changes in the echo path, some AEC devices utilize a convergence detector to monitor the convergence of the echo canceller filter. Such detectors rely on the principle that the adaptive filter will diverge when sudden echo path changes occur. The degree of convergence of the filter can be detected by examining the cross-correlation between the estimated echo and estimation error. By the principle of orthogonality, the cross-correlation should be nearly zero when the filter is converged. However, for a practical environment, and especially in the presence of double-talk (double-talking operation), false divergence detection is frequent. This is because speech from independent sources (near end, far end) has similar spectral and temporal characteristics, and detectors which employ short-term estimation tend to predict a non-zero cross correlation. Only if averaged for a substantial time can the cross-correlation be guaranteed to approach zero. Thus, a convergence detector based solely on cross-correlation is necessarily a compromise between accuracy and response time.

SUMMARY OF THE INVENTION

In a principle aspect, the present invention takes the form of an acoustic echo device, exhibiting a high degree of stability and provides quick and accurate response to sudden changes in the echo path. The acoustic echo device includes an adaptive echo canceller which adaptively modifies a near-end speech signal to cancel an acoustic feedback echo component in the signal to generate a modified near-end signal; the acoustic feedback being generated by a far-end speech signal received by the device. An echo return loss estimator provides an estimate of the echo return loss of the echo canceller. A means, responsive to the estimated echo return loss proportionally attenuates the modified near-end speech signal to substantially cancel acoustic feedback contained in the modified near-end signal.

Thus, an object of the present invention is to provide a rescue device to detect sudden echo path changes and to adjust the non-linear processor accordingly to compensate for such changes. A further objection is to provide a reliable filter convergence detector for the various talking modes by introducing an energy normalization factor. It is a further object to provide a unified mechanism to react to a sudden echo-path change. It is a further, more specific object to be able to both respond quickly to the true detection, and to recover quickly from the false detection.

In additional aspects, the acoustic echo control device employs a convergence detector which is responsive to sudden echo-path changes and which corrects the estimated echo return loss of the adaptive echo canceller to maintain the overall echo return loss and system stability.

These and other objects, features, and advantages of the present invention are discussed or apparent in the following detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

The preferred embodiment of the present invention is described herein with reference to the drawings wherein:

FIG. 1 is a block diagram of an electronic system utilizing the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FIG. 1 of the drawings shows a block diagram of a preferred AEC device 100, for use in a speakerphone, seen generally at 101, to reduce acoustic feedback echo, seen schematically at 105, from speaker 108 to microphone 106. As seen in FIG. 1, the speakerphone 101 may be coupled via the Public Switched Telephone Network (PSTN) 102 to a remote station 103 which may take the form of a conventional telephone or a speakerphone. The speakerphone 101 includes a microphone 104 for detecting acoustical energy to produce a near-end speech signal 106 and a speaker 108 for generating acoustical energy in accordance with far-end speech signal 110. The far-end speech signal 110 is generated by module 112 which digitizes the signal received from the remote station 103 and cancels line echo if such echo exists. Module 114 digitizes the incoming signal from microphone 104 to generate the near-end speech signal 106.

Far-end speech signal 110 is adjusted at 146, in accordance with output generated by speaker volume control module 144 to compensate for dynamic characteristics of speaker 108. The resulting volume compensated signal 113 is used to drive speaker 108 and as an input to far-end speech energy estimator 136 to detect energy in the far-end speech signal 110. The far-end speech energy estimate provided by estimator 136 is used by talking mode detectors 116 and 118, and by far-end noise floor estimator 140. A near-end speech energy estimator 138 responds to compensated signal 126 to provide a near-end speech energy estimate for use by near-end noise floor estimator 148, talking mode detector 118 and echo return loss estimator 142 and talking mode detector 116. Estimators 136 and 138 preferably operate by taking the mean square root, or magnitude of the respective input signal, and utilizing a low-pass filter, which produces a value indicative of the energy contained in the respective input signal.

Dual talking mode detectors 116 and 118 detect an operational mode from a group of operational modes which include: (1) quiescent mode, in which neither speakerphone 101 nor remote station 103 is transmitting; (2) transmission mode, in which speakerphone 101 is transmitting and remote station 103 is quiescent; (3) receive mode, in which remote station 103 is transmitting and speakerphone 101 is quiescent; and (4) double-talking mode, in which both speakerphone 101 and remote station 103 are transmitting. Talking mode detectors 116 and 118 both preferably operate in a manner described further below.

Separate talking mode detectors for the non-linear processor 128 and the adaptive filter 120 are advantageously employed in order to provide for the different requirements of the non-linear processor and the adaptive filter. It is desirable to stop the adaptation of the filter 120 while the AEC 100 is operated in either quiescent or double-talking mode when the ratio of the near-end signal and far-end signal is greater than a predetermined value. In order to accurately do so, the mode detection for the adaptation filter 120 requires different characteristics than that for the nonlinear processor 128. Ideally, the nonlinear processor requires quick response to reduce the cut-off of the conversation occurring between device 101 and remote station 103. It is also less tolerant to the false detection of double-talking mode. A talking mode detector for the adaptive filter, on the other hand, requires a third mode for the double-talking mode with low near-end speech energy. In this mode, the adaptive filter is still able to converge with a smaller update gain. Thus, it is difficult to optimize a single talking mode detector for both functions.

Talking mode detector 116 generates an output indicative of one of the four aforementioned talking modes to an adaptive filter 120 which generates an estimated echo signal 122 which has characteristics approximately that of actual echo signal seen at 105. The estimated echo signal 122 is subtracted at 124 from the near-end signal 106 to generate a first compensated signal 126.

Talking mode detector 118 generates an output indicative of one of the four aforementioned talking modes to Non-Linear Processor (NLP) 128 which provides additional echo attenuation to first compensated signal 126. The output 130 of the NLP 128 is multiplied at multiplier 132 with the first compensated signal 126 to generate transmittable speech signal 134 which is converted into analog form by a digital-to-analog (D/A) converter in module 112 and transmitted over the PSTN 102 to remote station 103.

Signal 130 is advantageously generated by NLP 128 to proportionally attenuate signal 126 based on the estimated echo return loss (ERL) 149, as provided by module 142, of the adaptive filter 120. Thus, when the adaptive filter 120 does not provide sufficient attenuation, for example during an initial condition or sudden echo-path change, then the output of the non-linear processor changes to increase the attenuation of signal 126. The enhancement of the overall system stability and echo return loss are thus maintained at the cost of lower interactivity. When the filter 120 converges, the output of the nonlinear processor changes to decrease the attenuation of signal 126 in order to increase the interactivity of the conversation. To achieve an optimal trade-off between ERL and interactivity, the additional attenuation provided by the nonlinear processor is advantageously determined by the estimated ERL 149 of the adaptive filter 120 and the talking mode detected by the detector 118. The echo return loss estimator 142 determines the estimated echo return loss by detecting the valley, using a valley detector, of the ratio between the signal 126 and reference signal, seen at 113. The following table shows the operation of the talking mode detector, NLP and adaptive filter for the various talking modes of the speakerphone:

    __________________________________________________________________________                             Nonlinear Processor                                                                      Adaption                                     Talking Mode                                                                             Qualification Attenuation (Gnlp)                                                                       Control                                      __________________________________________________________________________     Quiescent Mode                                                                           Ens < Lnf + Tln and                                                                          Fixed Attenuation to                                                                     No Adaption                                            Efs < Fnf + Tfn                                                                              maintain the system                                                            stable                                                 Transmitting Mode                                                                        Ens > Lnf + Tln and                                                                          No attenuation                                                                           No Adaption                                            Efs < Fnf + Tfn and                                                            (Ens - Lnf)/(Efs - Fnf) > Lerl +                                               Terl                                                                 Receiving Mode                                                                           Efs > Fnf + Tfn and                                                                          Gnlp = Gdesired/Lerl                                                                     Adaption                                               (Ens - Lnf)/(Efs -                                                             Fnf) < Lerl + Terl                                                   Double-Talking Mode                                                                      Efs > Fnf + Tfn and                                                                          No attenuation                                                                           if Efs/Ens >                                           Els > Lnf + Tln and     Ttrain:                                                (Es - Lnf)/(Efs - Fnf) >                                                                               Adaption.                                              Lerl + Terl             Else:                                                                          No adaption                                  __________________________________________________________________________

Ramp-up and Ramp-down functions are advantageously employed to smooth the non-linear processor attenuation transition. The abbreviations used in the table above are understood to have the following meanings:

Efs: far-end speech energy estimation, generated by estimator 136

Ens: near-end speech energy estimation, generated by estimator 138

Lnf: near-end noise floor, generated by estimator 148

Fnf: far-end noise floor, generated by estimator 140

Lerl: near-end echo return loss estimation, generated by estimator 142

Gdesired: the desired echo return loss, predetermined

Gnlp: Additional gain by the non linear processor, generated by NLP 128.

Ttrain: the maximum SNR for the adaptive filter to converge, predetermined

Tln, Tfn, Terl Empirically determined values.

In the table above, the column labelled "Talking-Mode" shows the different talking modes of the speakerphone. The column labelled "Qualification" shows how each of the modes in the talking-mode column are determined. The column labelled "Non-Linear Processor Attenuation (Gnlp) shows the manner in which the NLP provides attenuation for each of the talking modes. The column labelled "Adaptation Control" shows the manner in which adaptation of the adaptive filter is controlled.

As seen in the table, the speakerphone is determined to be in quiescent mode if the near-end speech energy estimation is less than the near-end noise floor and if the far-end speech energy estimation is less than the far-end noise floor. In quiescent mode, the NLP provides a fixed amount of attenuation to maintain system stability, and no adaptation is performed in the adaptive filter. The speakerphone is determined to be in transmitting mode if the near-end speech energy estimation is greater than the near-end noise floor and if the far-end speech energy estimation is less than the near-end noise floor and if the ratio of the difference between the near-end speech energy estimation and the near-end noise floor to the difference between the far-end speech energy estimation and the far-end noise floor is greater than the near-end echo return loss estimation. In transmitting mode (which is a single-talking mode), the NLP provides no attenuation and no adaptation is performed in the adaptive filter.

The speakerphone is determined to be in receiving mode if the far-end speech energy estimation is greater than the far-end noise floor and if the ratio of the difference between the near-end speech energy estimation and the near-end noise floor to the difference between the far-end speech energy estimation and the far-end noise floor is less than the near-end echo return loss estimation. In the receiving mode (which is a single-talking mode), the NLP provides attenuation (Gnlp) as a function of the ratio of the desired echo return loss to the near-end echo return loss estimation, and the adaptive filter is in adapt mode. The speakerphone is determined to be in double-talking mode if the far-end and near-end speech energy estimations are each greater than their respective floors and if the ratio described above for the transmitting and receiving modes is greater than the near-end echo return loss estimation. In the double-talking mode, the NLP provides no attenuation. The adaptive filter adapts if the ratio of the far-end speech energy estimation to the near-end speech energy estimation is greater than the predetermined maximum signal-to-noise ratio (SNR) for the adaptive filter to converge.

The convergence detector 121 is a cross-correlation type convergence detector which employs an energy normalization factor to increase the reliability of the detection. The convergence detector generates a Filter Convergence Indication (FCI) which is indicative of the convergence of the adaptive filter 120 in accordance with relationship shown in equation (1) below: ##EQU1## where, e'(n) is the estimated echo 122 generated by the adaptive filter 120;

o(n) is the signal 126 generated by difference module 124; and

d(n) is the near-end signal 106.

A low-pass filter, seen in equation (1) as "L", is advantageously employed to approximate the relationship shown in equation (2) below which describes the convergence of the adaptive filter 120: ##EQU2## As seen in equation (1), the absolute value is used to approximate the mean-square calculation performed in the equation (2). The numerator of equation (2) is the cross-correlation of the estimated echo signal 122 and the residual echo left uncancelled by the estimated echo signal 122. The denominator of the equation (2) normalizes the cross-correlation.

Utilizing the mean square of the near-end signal 106 employed in the denominator of the relationship affords at least two advantages. First, when the filter is nearly converged, the residual echo, defined as the difference between the actual echo signal 105 and the estimated echo signal 122, is relatively small compared to the near-end speech signal 106. Under such a condition, utilization of the near-end signal to normalize the cross-correlation reduces the sensitivity of the detector to the presence or absence of near-end speech. Thus, false detection in the presence of near-end speech is reduced. A second advantage to using the mean square of the near-end signal 106 is that sensitivity of the convergence detector to divergence of the adaptive filter 120 is reduced, by making the denominator of equation (2) independent of the degree of filter convergence. In the implementation of the relationship shown in equation (2), by the low-pass operator shown in equation (1), the time constant of the low-pass filter ranges from 64 milliseconds to 512 milliseconds for 8 KHz sampled speech signals.

In addition to generating the filter convergence indication, which provides a quantitative indication of the convergence of the adaptive filter, the convergence indicator also generates a divergence indication which provides a qualitative indication of the divergence of the adaptive filter. The divergence indication is generated by comparing the FCI value to a predetermined threshold value. If the FCI value is greater than the threshold value then the divergence value is set to a value to indicate that the adaptive filter has not diverged. If the FCI value is less than or equal to the threshold value then the divergence value is set to a value to indicate that the adaptive filter has diverged.

If the adaptive filter is determined to have diverged, then the echo return loss estimator 142 is determined to have an estimated echo-return-loss value corresponding to an empirically determined maximum echo return loss which is indicative of the acoustic coupling between the speaker 108 and the microphone 104. This value is used to adjust the talking mode detectors 116 and 118 and the NLP 128. This mechanism provides two advantages. First, it provides quick self-recovery from false detection. Since the divergence detection only affects the echo return estimator value, the echo return estimator quickly recovers to the actual estimation in the case of false detection. Second, the mechanism provides quick and accurate response. The primary consequence of a sudden echo path change is the divergence of the adaptive filter, which leads to a significant degradation in the ability of the adaptive filter to effectively cancel acoustic echo. Ideally, the talking mode detectors and the NLP should react quickly and correctly to the change in the echo return of the adaptive filter. The mechanism described above advantageously quickly sets the estimated echo-return-loss value to a value which is indicative of its worst case, and is then allowed to decay quickly to the actual level. The echo return loss estimator 142, which is a valley detector, provides rapid decay, and accordingly, quick reduction of the echo return loss estimation to the appropriate level. A self-recovery mechanism is provided if a false divergence detection is made, or if a divergence detection is missed. In such a situation, the echo return loss estimator will correct itself by detecting the valley of the ratio between the modified near-end signal 124 and the reference signal 113.

The structure shown in FIG. 1 is for illustration purposes only. Preferably the functional modules seen in FIG. 1 are implemented by a microprocessor operating under stored program control. The microprocessor makes use of a Read-Only Memory (ROM) for storage of control programs and permanent data and Random Access Memory (RAM) for temporary storage of programs and data. Input/Output (I/O) circuitry is incorporated either in the microprocessor or in peripheral chips for receiving and transmitting information to and from the microprocessor.

Although the foregoing description of the preferred embodiment will enable a person of ordinary skill in the art to make and use the invention, a detailed assembly language subroutine₋₋ listing of a preferred program which may be executed by the microprocessor to implement the above described operations is listing below. In the listing below, the code at line numbers 226-254, 352-399, and 457-515 corresponds to the adaptive filter 120. The code at line numbers 264-338 corresponds to the convergence detector 121. The code at line numbers 339-351 and 408-443 corresponds to the echo return loss estimator 142 and the energy floor estimators 136 and 138. The code at line numbers 523-689 corresponds to the nonlinear processor 128. The code for the talking mode detectors is implemented as conditional execution code in the adaptive filter and nonlinear processor code.

The program listed below may be converted into machine executable form using the TMS 320C2/C5x Assembler available from the Texas Instruments Corporation. Additional detailed features of the system will become apparent to those skilled in the art from reviewing the program.

A preferred embodiment of the present invention has been described herein. It is to be understood, of course, that changes and modifications may be made in the embodiment without departing from the true scope and spirit of the present invention, as defined by the appended claims. For instance, although the specific embodiment described above takes the form of an acoustic echo device for a speakerphone, application of the principles described herein, to other types of devices, such as hybrid echo cancellers, will be apparent to those skilled in the art. ##SPC1## 

What is claimed is:
 1. An acoustic echo control device that suppresses echoes, comprising:a first talking mode detector for providing a first signal indicative of one of three talking modes; an adaptive filter responsive to said first signal, said adaptive filter generating a replica of the echo component of a near-end signal; a summer device that subtracts the replica of the echo component of the near-end signal from the near-end signal for producing a modified near-end signal; a second talking-mode detector for providing a second signal indicative of one of three talking modes; and a nonlinear attenuator wherein said nonlinear attenuator provides one of no attenuation, attenuation that adjusts to achieve a desired echo return loss, and a fixed amount of attenuation, of the modified near-end signal in response to said second talking-mode detector.
 2. An acoustic echo control device of claim 1 further comprising:a convergence detector for producing a convergence indicator in response to the near-end signal, the modified near-end signal, and the echo replica.
 3. An acoustic echo control device of claim 2 wherein said convergence detector is a cross-correlation type convergence detector that employs an energy normalization factor.
 4. An acoustic echo control device of claim 2 wherein said convergence indicator is determined in accordance with the equation:

    FCI=L[e'(n)*o(n)]/(L[|e'(n)|]*L[|d(n)|]

where, L is a low-pass filter; e'(n) is the echo replica generated by the adaptive filter; o(n) is the modified near-end signal; and d(n) is the near-end signal.
 5. An acoustic echo control device of claim 2 wherein said convergence detector further comprises a correlator for generating a correlation of said echo replica and said modified near-end signal.
 6. An acoustic echo control device of claim 5 wherein said convergence detector further comprises a normalizer for normalizing said correlation with respect to said echo replica.
 7. An acoustic echo control device of claim 5 wherein said normalizer also normalizes said correlation with respect to said modified near-end signal.
 8. An acoustic echo control device of claim 5 wherein said normalizer further comprises a second correlator for generating a second correlation of said echo replica and said modified near-end signal, wherein said normalizer normalizes said correlation with respect to said second correlation.
 9. An acoustic echo control device of claim 2 wherein said nonlinear attenuator provides attenuation that adjusts to achieve a desired echo return loss in response to said convergence indicator in response to a predetermined level of echo return loss.
 10. An acoustic echo control device of claim 2 further comprising a near-end signal energy detector and a far-end signal energy detector, wherein said nonlinear attenuator is responsive to said far-end and near-end signal energy detectors.
 11. An acoustic echo control device of claim 2 wherein said convergence indicator is normalized with respect to the near-end signal.
 12. A method for suppressing the echo component of a near-end signal, comprising the steps of:generating a replica of an echo component of a near-end signal; subtracting the replica of the echo component of a near-end signal from the near-end signal to produce a modified near-end signal; producing, from first and second talking mode detectors, first and second talking mode signals indicative of one of three talking modes; adapting filter coefficients in response to said first talking mode signal; and attenuating the modified near-end signal in response to said second talking mode signal, wherein the attenuation is fixed, adjusted to achieve a desired echo return loss, or nonexistent, depending on the talking mode that is indicated by said second talking mode signal.
 13. The method of claim 12 further comprising the step of:producing a convergence indicator in response to the near-end signal, the echo replica, and the modified near-end signal.
 14. The method of claim 13 wherein the step of generating the replica of the echo component of the near-end signal includes using an adaptive filter.
 15. The convergence indicator of claim 14 wherein the indicator is indicative of convergence of the adaptive filter in accordance with the equation:

    FCI=L[e'(n)*o(n)]/(L[|e'(n)|]*L[|d(n)|])

where, L is a low-pass filter; e'(n) is the echo replica generated by the adaptive filter; o(n) is the modified near-end signal; and d(n) is the near-end signal.
 16. The method of claim 13 wherein the step of producing a convergence indicator is performed using a cross-correlation type convergence detector that employs an energy normalization factor.
 17. The method of claim 13 wherein the convergence indicator is normalized with respect to the near-end signal.
 18. The method of claim 12 wherein the step of producing the convergence indicator comprises the step of correlating the echo replica with the modified near-end signal.
 19. The method of claim 18 wherein the step of producing the convergence indicator comprises the step of normalizing the correlation with respect to the echo replica.
 20. The method of claim 18 wherein the step of producing the convergence indicator comprises the step of normalizing the correlation with respect to the near-end signal.
 21. The method of claim 18 wherein the step of producing the convergence indicator comprises the step of normalizing the correlation with respect to the correlation of the echo replica and the near-end signal.
 22. The method of claim 12 wherein the step of attenuating is performed in response to a predetermined level of echo return loss. 